Putting Book Scans PDFs in Scrapbox 2019 - NISHIO Hirokazu's Scrapbox (Auto-translated from Japanese)

Putting Book Scans PDFs in Scrapbox 2019

2019-10-08

Place Book Scanning PDF in [Scrapbox

https://www.facebook.com/toshiyukimasui/posts/10157675595687498

Gyazo There is a Gem

masui/Book2Scrapbox: a device for reading self-prepared books in Scrapbox

Upload to Gyazo Pro via script after disassembling into images

Gyazo Pro uses Google Cloud Platform's CLOUD VISION API for OCR.

It takes time, so we get OCR data after a while.

Readings from https://github.com/masui/Book2Scrapbox

Scanning results from ScanSnap are retrieved in pdfimages.

Related PDF to PNG conversion.

If it's a cut-and-scan PDF, that's OK.

PDFs of slides, etc. are not acceptable.

Locally, folders are cut and stored with MD5 hash.

Sync it to AWS.

AWS Command Line Interface (CLI: an integrated tool to manage AWS services)| AWS must be installed

Installing the AWS CLI - AWS Command Line Interface

That's very kindly written.

AWS CLI Configuration - AWS Command Line Interface

Deletion on hand does not delete anything on S3.

Sync to AWS is not really required.

Because I'm sending the contents of the FILE to gyazo.

https://github.com/nishio/Book2Scrapbox

Use pdftocairo since slides cannot be converted to images with pdfimeges

$ pdftocairo -r 200 -f 0 -jpeg <pdf> pages

see PDF to PNG conversion

Multiple PDFs are now combined into a single JSON

pdfstojson.rb calls makejson.rb

I looked into how to do it in Python, but I was able to achieve it by using makejson.rb as a child process.

Download and add the OCR results from Gyazo a while after the JSON is ready.

---

This page is auto-translated from /nishio/書籍スキャンPDFをScrapboxに置く2019 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.